Automatic Transcription Verification of Broadcast News and Similar Speech Corpora
نویسندگان
چکیده
In the last few years, the focus in ASR research has shifted from the recognition of clean read speech (i.e. WSJ) to the more challenging task of transcribing found speech like broadcast news (Hub-4 task) and telephone conversations (Switchboard). Available training corpora tend to become larger and more erroneous than before, as transcribing found speech is more difficult. In this paper we present a method to automatically detect faulty training scripts. Based on the Hub-4 task we will report on the efficiency of error detection with the proposed method and investigate the effect of both manually and automatically cleaned training corpora on the word error rate (WER) of the RWTH large vocabulary continuous speech recognition (LVCSR) system. This work is a joint effort of the University of Technology (RWTH) and Philips Research Laboratories Aachen, Germany.
منابع مشابه
Automatic verification of broadcast news transcriptions
In this paper we present a method for automatically detecting erroneous training scripts for speech corpora like Broadcast News and Switchboard. Based on the Hub-4 task we will report on the performance of error detection with the proposed method and investigate the effects of both manually and automatically cleaned training corpora on the performance of the RWTH speech recognition system. Our ...
متن کاملAn Analysis of Sentence Segmentation Features for Broadcast News, Broadcast Conversations, and Meetings
Information retrieval techniques for speech are based on those developed for text, and thus expect structured data as input. An essential task is to add sentence boundary information to the otherwise unannotated stream of words output by automatic speech recognition systems. We analyze sentence segmentation performance as a function of feature types and transcription (manual versus automatic) f...
متن کاملA Lightweight on-the-fly Capitalization System for Automatic Speech Recognition
This paper describes a lightweight method for capitalizing speech transcriptions. Several resources were used, including a lexicon, newspaper written corpora and speech transcriptions. Different approaches were tested both generative and discriminative: finite state transducers, automatically built from Language Models; and maximum entropy models. Evaluation results are presented both for writt...
متن کاملStructural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations
Structural metadata extraction (MDE) research aims to develop techniques for automatic conversion of raw speech recognition output to forms that are more useful to humans and to downstream automatic processes. It may be achieved by inserting boundaries of syntactic/semantic units to the flow of speech, labeling non-content words like filled pauses and discourse markers for optional removal, and...
متن کاملToward Automatic Recognition of Japanese Broadcast News
In this paper we report on automatic recognition of Japanese broadcast-news speech. We have been working on largevocabulary continuous speech recognition (LVCSR) for Japanese newspaper speech transcription and achieved reasonably good performance. We have recently applied our LVCSR system to transcribing Japanese broadcast-news speech. We extended the vocabulary to 20k words and trained the lan...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999